The objective of the project is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix,classification_report
The objective of this diabetes dataset is to predict whether patient has diabetes or not. The dataset consists of several medical predictor (Independent variable) and one Target variable(Outcome).Predictor variables include pregnancies,Glucose,Blood Pressure, Skin Thickness, Insulin, BMI, DiabetesPedigreeFunction,Age and Outcome
df=pd.read_csv('C:\sample\diabetes.csv')
df
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
| 1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
| 2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
| 3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
| 4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 763 | 10 | 101 | 76 | 48 | 180 | 32.9 | 0.171 | 63 | 0 |
| 764 | 2 | 122 | 70 | 27 | 0 | 36.8 | 0.340 | 27 | 0 |
| 765 | 5 | 121 | 72 | 23 | 112 | 26.2 | 0.245 | 30 | 0 |
| 766 | 1 | 126 | 60 | 0 | 0 | 30.1 | 0.349 | 47 | 1 |
| 767 | 1 | 93 | 70 | 31 | 0 | 30.4 | 0.315 | 23 | 0 |
768 rows × 9 columns
Exploratory Data Analysis (EDA) is a step in the Data Analysis Process, where a number of techniques are used to better understand the dataset being used.
df.head()
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
| 1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
| 2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
| 3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
| 4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
df.tail()
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
|---|---|---|---|---|---|---|---|---|---|
| 763 | 10 | 101 | 76 | 48 | 180 | 32.9 | 0.171 | 63 | 0 |
| 764 | 2 | 122 | 70 | 27 | 0 | 36.8 | 0.340 | 27 | 0 |
| 765 | 5 | 121 | 72 | 23 | 112 | 26.2 | 0.245 | 30 | 0 |
| 766 | 1 | 126 | 60 | 0 | 0 | 30.1 | 0.349 | 47 | 1 |
| 767 | 1 | 93 | 70 | 31 | 0 | 30.4 | 0.315 | 23 | 0 |
df.sample(7)
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
|---|---|---|---|---|---|---|---|---|---|
| 330 | 8 | 118 | 72 | 19 | 0 | 23.1 | 1.476 | 46 | 0 |
| 223 | 7 | 142 | 60 | 33 | 190 | 28.8 | 0.687 | 61 | 0 |
| 566 | 1 | 99 | 72 | 30 | 18 | 38.6 | 0.412 | 21 | 0 |
| 136 | 0 | 100 | 70 | 26 | 50 | 30.8 | 0.597 | 21 | 0 |
| 307 | 0 | 137 | 68 | 14 | 148 | 24.8 | 0.143 | 21 | 0 |
| 401 | 6 | 137 | 61 | 0 | 0 | 24.2 | 0.151 | 55 | 0 |
| 55 | 1 | 73 | 50 | 10 | 0 | 23.0 | 0.248 | 21 | 0 |
df.shape
(768, 9)
Number of rows = 768
Number of columns = 9
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 768 entries, 0 to 767 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Pregnancies 768 non-null int64 1 Glucose 768 non-null int64 2 BloodPressure 768 non-null int64 3 SkinThickness 768 non-null int64 4 Insulin 768 non-null int64 5 BMI 768 non-null float64 6 DiabetesPedigreeFunction 768 non-null float64 7 Age 768 non-null int64 8 Outcome 768 non-null int64 dtypes: float64(2), int64(7) memory usage: 54.1 KB
df.describe()
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
|---|---|---|---|---|---|---|---|---|---|
| count | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 |
| mean | 3.845052 | 120.894531 | 69.105469 | 20.536458 | 79.799479 | 31.992578 | 0.471876 | 33.240885 | 0.348958 |
| std | 3.369578 | 31.972618 | 19.355807 | 15.952218 | 115.244002 | 7.884160 | 0.331329 | 11.760232 | 0.476951 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.078000 | 21.000000 | 0.000000 |
| 25% | 1.000000 | 99.000000 | 62.000000 | 0.000000 | 0.000000 | 27.300000 | 0.243750 | 24.000000 | 0.000000 |
| 50% | 3.000000 | 117.000000 | 72.000000 | 23.000000 | 30.500000 | 32.000000 | 0.372500 | 29.000000 | 0.000000 |
| 75% | 6.000000 | 140.250000 | 80.000000 | 32.000000 | 127.250000 | 36.600000 | 0.626250 | 41.000000 | 1.000000 |
| max | 17.000000 | 199.000000 | 122.000000 | 99.000000 | 846.000000 | 67.100000 | 2.420000 | 81.000000 | 1.000000 |
In the above table, the min value of columns ‘Glucose’,’BloodPressure’,’Skin Thickness’,’Insulin’,’BMI’, is zero(0). It is clear that this values can’t be zero. So I am going to impute mean values of these columns instead of zero.
df.isnull()
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | False | False | False | False | False | False | False | False | False |
| 1 | False | False | False | False | False | False | False | False | False |
| 2 | False | False | False | False | False | False | False | False | False |
| 3 | False | False | False | False | False | False | False | False | False |
| 4 | False | False | False | False | False | False | False | False | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 763 | False | False | False | False | False | False | False | False | False |
| 764 | False | False | False | False | False | False | False | False | False |
| 765 | False | False | False | False | False | False | False | False | False |
| 766 | False | False | False | False | False | False | False | False | False |
| 767 | False | False | False | False | False | False | False | False | False |
768 rows × 9 columns
df.isnull().sum()
Pregnancies 0 Glucose 0 BloodPressure 0 SkinThickness 0 Insulin 0 BMI 0 DiabetesPedigreeFunction 0 Age 0 Outcome 0 dtype: int64
df.isnull().sum().sum()
0
There is no NULL values in the given dataset
df.columns
Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
dtype='object')
print('No of zero values in Glucose',df[df['Glucose']==0].shape[0])
No of zero values in Glucose 5
print('No of zero values in BloodPressure',df[df['BloodPressure']==0].shape[0])
No of zero values in BloodPressure 35
print('No of zero values in SkinThickness',df[df['SkinThickness']==0].shape[0])
No of zero values in SkinThickness 227
print('No of zero values in Insulin',df[df['Insulin']==0].shape[0])
No of zero values in Insulin 374
print('No of zero values in BMI',df[df['BMI']==0].shape[0])
No of zero values in BMI 11
df['Glucose']=df['Glucose'].replace(0,df['Glucose'].mean())
print('No.of zero value in Glucose',df[df['Glucose']==0].shape[0])
No.of zero value in Glucose 0
df['BloodPressure']=df['BloodPressure'].replace(0,df['BloodPressure'].mean())
print('No.of zero value in BloodPressure',df[df['BloodPressure']==0].shape[0])
No.of zero value in BloodPressure 0
df['SkinThickness']=df['SkinThickness'].replace(0,df['SkinThickness'].mean())
print('No.of zero value in SkinThickness',df[df['SkinThickness']==0].shape[0])
No.of zero value in SkinThickness 0
df['Insulin']=df['Insulin'].replace(0,df['Insulin'].mean())
print('No.of zero value in Insulin',df[df['Insulin']==0].shape[0])
No.of zero value in Insulin 0
df['BMI']=df['BMI'].replace(0,df['BMI'].mean())
print('No.of zero value in BMI',df[df['BMI']==0].shape[0])
No.of zero value in BMI 0
df.describe()
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
|---|---|---|---|---|---|---|---|---|---|
| count | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 |
| mean | 3.845052 | 121.681605 | 72.254807 | 26.606479 | 118.660163 | 32.450805 | 0.471876 | 33.240885 | 0.348958 |
| std | 3.369578 | 30.436016 | 12.115932 | 9.631241 | 93.080358 | 6.875374 | 0.331329 | 11.760232 | 0.476951 |
| min | 0.000000 | 44.000000 | 24.000000 | 7.000000 | 14.000000 | 18.200000 | 0.078000 | 21.000000 | 0.000000 |
| 25% | 1.000000 | 99.750000 | 64.000000 | 20.536458 | 79.799479 | 27.500000 | 0.243750 | 24.000000 | 0.000000 |
| 50% | 3.000000 | 117.000000 | 72.000000 | 23.000000 | 79.799479 | 32.000000 | 0.372500 | 29.000000 | 0.000000 |
| 75% | 6.000000 | 140.250000 | 80.000000 | 32.000000 | 127.250000 | 36.600000 | 0.626250 | 41.000000 | 1.000000 |
| max | 17.000000 | 199.000000 | 122.000000 | 99.000000 | 846.000000 | 67.100000 | 2.420000 | 81.000000 | 1.000000 |
heatmap() to make a heatmap of the data to visualize the missing data in each variable.
plt.figure(figsize=(25,25))
sns.heatmap(df.isnull())
<Axes: >
f,ax=plt.subplots(1,2,figsize=(10,5))
df['Outcome'].value_counts().plot.pie(explode=[0,0.1],autopct='%1.1f%%',ax=ax[0],shadow=True)
ax[0].set_title('Outcome')
ax[0].set_ylabel('')
sns.countplot(x=df.Outcome)
plt.title("Count Plot for Outcome")
N,P = df['Outcome'].value_counts()
print('Negative(0):',N)
print('Positive(1):',P)
plt.grid()
plt.show()
Negative(0): 500 Positive(1): 268
Out of total 768 people, 268 are diabetic(positive(1)) and 500 are non-diabetic(negative(0)).
In the outcome column, 1 represents diabetes positive and 0 represents diabetes negative.
The countplot tells tha the dataset is imbalanced, as number of patients who don't have diabetes is more than those who have diabetes.
Histograms are one of the most common graphs used to display numeric data.
Distribution of the data- Whether the data is normally distributed or if it's skewed to the left or right
df.hist (bins=10,figsize=(10,10))
plt.show()
A scatter plot is a diagram where each value in the data set is represented by a dot.
from pandas.plotting import scatter_matrix
scatter_matrix(df,figsize=(20,20));
Pairplot allows us to plot pairwise relationships between variables within a dataset.
sns.pairplot(data=df,hue='Outcome')
plt.show()
Correlation analysis is used to quantify the degree to which two variables are related. Through the correlation analysis, evaluate correlation coefficient that tells how much one variable changes when the other one does. Correlation analysis provides with a linear relationship between two variables. When we correlate feature variables with the target variable, we get to know that how much dependency is there between particular feature variables and target variable.
corrmat = df.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(10,10))
g=sns.heatmap(df[top_corr_features].corr(),annot=True)
Observations: From the correaltion heatmap, we can see that there is a high correlation between Outcome and [Pregnancies, Glucose, BMI, Age, Insulin). We can select these features to accept input from the user and predict the outcome.
X=df.iloc[:,0:-1]
y=df.iloc[:,-1]
classes = {'yes':1,'No':0}
X = df.drop('Outcome', axis=1)# X define independent
y = df['Outcome'] # y define dependent variable
print('Shape of X =', X.shape)
print('Shape of y = ', y.shape)
Shape of X = (768, 8) Shape of y = (768,)
X
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | |
|---|---|---|---|---|---|---|---|---|
| 0 | 6 | 148.0 | 72.0 | 35.000000 | 79.799479 | 33.6 | 0.627 | 50 |
| 1 | 1 | 85.0 | 66.0 | 29.000000 | 79.799479 | 26.6 | 0.351 | 31 |
| 2 | 8 | 183.0 | 64.0 | 20.536458 | 79.799479 | 23.3 | 0.672 | 32 |
| 3 | 1 | 89.0 | 66.0 | 23.000000 | 94.000000 | 28.1 | 0.167 | 21 |
| 4 | 0 | 137.0 | 40.0 | 35.000000 | 168.000000 | 43.1 | 2.288 | 33 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 763 | 10 | 101.0 | 76.0 | 48.000000 | 180.000000 | 32.9 | 0.171 | 63 |
| 764 | 2 | 122.0 | 70.0 | 27.000000 | 79.799479 | 36.8 | 0.340 | 27 |
| 765 | 5 | 121.0 | 72.0 | 23.000000 | 112.000000 | 26.2 | 0.245 | 30 |
| 766 | 1 | 126.0 | 60.0 | 20.536458 | 79.799479 | 30.1 | 0.349 | 47 |
| 767 | 1 | 93.0 | 70.0 | 31.000000 | 79.799479 | 30.4 | 0.315 | 23 |
768 rows × 8 columns
y
0 1
1 0
2 1
3 0
4 1
..
763 0
764 0
765 0
766 1
767 0
Name: Outcome, Length: 768, dtype: int64
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=7)
print('Shape of X_train=', X_train.shape)
print('Shape of y_train=', y_train.shape)
print('Shape of X_test=', X_test.shape)
print('Shape of y_test=', y_test.shape)
Shape of X_train= (614, 8) Shape of y_train= (614,) Shape of X_test= (154, 8) Shape of y_test= (154,)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)
from sklearn.ensemble import RandomForestClassifier
classifier_rf=RandomForestClassifier(n_estimators=100,criterion='gini')
classifier_rf.fit(X_train,y_train)
RandomForestClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier()
rf_score = classifier_rf.score(X_test,y_test)
rf_score
0.8116883116883117
y_pred_rf = classifier_rf.predict(X_test)
y_pred_rf
array([0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0,
0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1,
0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
dtype=int64)
conf_matrix = confusion_matrix(y_test, y_pred_rf)
conf_mat=confusion_matrix(y_test, y_pred_rf)
print(conf_mat)
[[86 11] [18 39]]
import seaborn as sns
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues",
xticklabels=classes, yticklabels=classes)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()
from sklearn.metrics import classification_report
# Print a classification report that includes precision, recall, and F1-score
print("Classification Report:")
print(classification_report(y_test, y_pred_rf, target_names=classes))
Classification Report:
precision recall f1-score support
yes 0.83 0.89 0.86 97
No 0.78 0.68 0.73 57
accuracy 0.81 154
macro avg 0.80 0.79 0.79 154
weighted avg 0.81 0.81 0.81 154
from sklearn.model_selection import cross_val_score, KFold
# Create a k-fold cross-validator
kf = KFold(n_splits=5, shuffle=True, random_state=0)
scores = cross_val_score(classifier_rf, X_train_sc, y_train, cv=kf)
print("Cross-validation scores:", scores)
classifier_rf_kfold_mean_score = np.mean(scores)
print("Mean Accuracy:",classifier_rf_kfold_mean_score )
Cross-validation scores: [0.78861789 0.77235772 0.73170732 0.74796748 0.71311475] Mean Accuracy: 0.7507530321204852
from sklearn.linear_model import LogisticRegression
classifier=LogisticRegression(solver='liblinear')
classifier.fit(X_train,y_train)
classifier.score(X_test,y_test)
0.8051948051948052
y_test_prediction=classifier.predict(X_test)
y_test_prediction
array([0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0,
0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1,
0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
dtype=int64)
conf_mat=confusion_matrix(y_test,y_test_prediction)
print(conf_mat)
[[92 5] [25 32]]
plt.figure(figsize=(12,6))
sns.heatmap(conf_mat,annot=True,fmt='d')
plt.title("Confusion Matrix of test data")
plt.xlabel("Predicted value")
plt.ylabel("Actual value")
Text(120.72222222222221, 0.5, 'Actual value')
print(classification_report(y_test,y_test_prediction))
precision recall f1-score support
0 0.79 0.95 0.86 97
1 0.86 0.56 0.68 57
accuracy 0.81 154
macro avg 0.83 0.75 0.77 154
weighted avg 0.82 0.81 0.79 154
from sklearn.model_selection import cross_val_score, KFold
# Create a k-fold cross-validator
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(classifier, X_train_sc, y_train, cv=kf)
print("Cross-validation scores:", scores)
logistic_kfold_mean_score = np.mean(scores)
print("Mean Accuracy:",logistic_kfold_mean_score )
Cross-validation scores: [0.73170732 0.80487805 0.7804878 0.74796748 0.79508197] Mean Accuracy: 0.7720245235239238
from sklearn.svm import SVC
svm_model = SVC(kernel='linear', C=1)
svm_model.fit(X_train_sc, y_train)
svm_model.score(X_test_sc,y_test)
0.7857142857142857
y_pred = svm_model.predict(X_test_sc)
y_pred
array([0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0,
1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1,
0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0],
dtype=int64)
conf_matrix = confusion_matrix(y_test, y_pred)
import seaborn as sns
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", xticklabels=classes, yticklabels=classes)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()
from sklearn.metrics import classification_report
# Print a classification report that includes precision, recall, and F1-score
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=classes))
Classification Report:
precision recall f1-score support
yes 0.78 0.92 0.84 97
No 0.80 0.56 0.66 57
accuracy 0.79 154
macro avg 0.79 0.74 0.75 154
weighted avg 0.79 0.79 0.78 154
from sklearn.model_selection import cross_val_score, KFold
# Create a k-fold cross-validator
kf = KFold(n_splits=5, shuffle=True, random_state=42)
# Perform k-fold cross-validation for linear SVM
scores = cross_val_score(svm_model, X_train_sc, y_train, cv=kf)
print("Cross-validation scores:", scores)
svm_kfold_mean_score = np.mean(scores)
print("Mean Accuracy:",svm_kfold_mean_score )
Cross-validation scores: [0.7398374 0.81300813 0.76422764 0.73170732 0.77868852] Mean Accuracy: 0.7654938024790084
from sklearn.neighbors import KNeighborsClassifier
classifier_knn = KNeighborsClassifier(n_neighbors=5)
classifier_knn.fit(X_train,y_train)
KNeighborsClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KNeighborsClassifier()
knn_score = classifier_knn.score(X_test,y_test)
knn_score
0.7272727272727273
y_pred_knn = classifier_knn.predict(X_test)
y_pred_knn
array([0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0,
0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1,
0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
dtype=int64)
conf_matrix = confusion_matrix(y_test, y_pred_knn)
import seaborn as sns
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", xticklabels=classes, yticklabels=classes)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()
from sklearn.metrics import classification_report
# Print a classification report that includes precision, recall, and F1-score
print("Classification Report:")
print(classification_report(y_test, y_pred_knn, target_names=classes))
Classification Report:
precision recall f1-score support
yes 0.76 0.84 0.79 97
No 0.66 0.54 0.60 57
accuracy 0.73 154
macro avg 0.71 0.69 0.70 154
weighted avg 0.72 0.73 0.72 154
from sklearn.model_selection import cross_val_score, KFold
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(classifier_knn, X_train, y_train, cv=kf)
print("Cross-validation scores:", scores)
classifier_knn_kfold_mean_score = np.mean(scores)
print("Mean Accuracy:",classifier_knn_kfold_mean_score )
Cross-validation scores: [0.73170732 0.69918699 0.69105691 0.69918699 0.75409836] Mean Accuracy: 0.7150473144075703
from xgboost import XGBClassifier
xgb_model = XGBClassifier(gamma=0)
xgb_model.fit(X_train, y_train)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=0, gpu_id=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
n_estimators=100, n_jobs=None, num_parallel_tree=None,
predictor=None, random_state=None, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=0, gpu_id=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
n_estimators=100, n_jobs=None, num_parallel_tree=None,
predictor=None, random_state=None, ...)from sklearn import metrics
xgb_pred = xgb_model.predict(X_test)
print("Accuracy Score =", format(metrics.accuracy_score(y_test, xgb_pred)))
Accuracy Score = 0.7792207792207793
def visualize_data(X, y, preds, title):
plt.figure(figsize=(10, 6))
plt.scatter(X[y == 0]['Glucose'], X[y == 0]['BMI'], label='No Diabetes', alpha=0.7)
plt.scatter(X[y == 1]['Glucose'], X[y == 1]['BMI'], label='Diabetes', alpha=0.7)
plt.scatter(X[preds == 1]['Glucose'], X[preds == 1]['BMI'], label='Predicted Diabetes', marker='x', c='red')
plt.title(title)
plt.xlabel('Glucose')
plt.ylabel('BMI')
plt.legend()
plt.show()
visualize_data(X_test, y_test, y_pred_knn, 'K-Nearest Neighbors Predictions')
visualize_data(X_test, y_test, y_pred, 'Support Vector Machine Predictions')
visualize_data(X_test, y_test, y_pred_rf, 'Random Forest Predictions')
visualize_data(X_test, y_test, xgb_pred, 'XGBoost')
visualize_data(X_test, y_test, y_test_prediction, 'Logistic Regression')
patient1 =[1,89,66,23,94,28.1,0.167,21]
patient1=np.array([patient1])
patient1
array([[ 1. , 89. , 66. , 23. , 94. , 28.1 , 0.167, 21. ]])
pred=classifier.predict(patient1)
if pred[0]==1:
print('Patient is diabetic')
else:
print('Patient is not diabetic')
Patient is not diabetic
C:\Users\admin\anaconda3\lib\site-packages\sklearn\base.py:420: UserWarning: X does not have valid feature names, but LogisticRegression was fitted with feature names warnings.warn(
After using all these patient records, we are able to build a machine learning model (random forest – best one and logistic regression) to accurately predict whether or not the patients in the dataset have diabetes or not